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Abstract 

We analyze and exploit some scaling properties of the Affinity Propagation (AP) 
clustering algorithm proposed by Frey and Dueck (2007). First we observe that 
a divide and conquer strategy, used on a large data set hierarchically reduces the 
complexity 0{N'^) to 0(7V(''+2)/('»+i)), for a data-set of size N and a depth h 
of the hierarchical strategy. For a data-set embedded in a d-dimensional space, 
we show that this is obtained without notably damaging the precision except in 
dimension d = 2. In fact, for d larger than 2 the relative loss in precision scales like 
j^^{2-d)/ih+i)d^ Finally, under some conditions we observe that there is a value s* of 
the penalty coefficient, a free parameter used to fix the number of clusters, which 
separates a fragmentation phase (for s < s*) from a coalescent one (for s > s*) of 
the underlying hidden cluster structure. At this precise point holds a self-similarity 
property which can be exploited by the hierarchical strategy to actually locate its 
position. From this observation, a strategy based on AP can be defined to find out 
how many clusters are present in a given dataset. 



1 Introduction 

Since its invention by J. Pearl [1] in the context of Bayesian inference, it has been 
realized that the belief-propagation algorithm was related to many other algorithms 
encountered in various fields [2] and it has since diffused in many different areas 
(inference problems, signal processing, error codes, image segmentation ...). In 
the context of statistical physics, it is closely related to a certain type of mean-field 
approach (Bcthe-Peierls), more precisely the so-called Bethe-approximation, valid 
on sparse graphs 0j. This reconsideration in statistical physics terms, has given 
rise to a new generation of distributed algorithms. These address NP-hard com- 
binatorial optimization problems, like the survey-propagation algorithm of Mczard 
and Zecchina [J for the random K-SAT problems, where the factor-graph has a 
tree-like structure. Surprisingly enough, in some other context it works also well on 
dense factor-graphs as exemplified by the affinity propagation algorithm proposed 
by Frey and Dueck [S] for the clustering problem, which is also NP-hard. Akin 
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2 Introduction to belief-propagation 



if-centers, AP maps each data item onto an actual data item, called exemplar, 
and all items mapped onto the same exemplar form one cluster. Contrasting with 
if-centers, AP builds quasi-optimal clusters in terms of distortion, thus enforcing 
the cluster stability [Q. The price to pay for these understandability and stability 
properties is a quadratic computational complexity, except if the similarity matrix 
is made sparse with help of a pruning procedure. Nevertheless, a pre-treatment of 
the data would also be quadratic in the number if item, which is severely hinder- 
ing the usage of AP on large scale datasets. The basic assumption behind AP, is 
that cluster are of spherical shape. This limiting assumption has actually been ad- 
dressed by Leone and co-authors in [HI [7] , by softening a hard constraint present in 
AP, which impose that any exemplar has first to point to itself as oneself exemplar. 
Another drawback, which is actually common to most clustering techniques, is that 
there is a free parameter to fix which ultimately determines the number of clusters. 
Some methods based on EM [HI [5] or on information-theoretic consideration have 
been proposed [TP], but mainly use a precise parametrization of the cluster model. 
There exist also a different strategy based on similarity statistics [TT], that have 
been already recently combined with AP [12], at the expense of a quadratic price. 
In an earlier work [13( I14j , a hierarchical approach, based on a divide and conquer 
strategy was proposed to decrease the AP complexity and adapt AP to the context 
of Data Streaming. In this paper we extend the scaling analysis of this procedure 
initiated in [14j and propose a way to determine the number of clusters. 

The paper is organized as follows. In Section [5] we start from a brief description of 
BP and some of its properties. We sec how AP can be derived from it and present 
some extensions to AP, including the soft-constraint affinity propagation extension 
(SCAP) to AP. In Section[31 the computational complexity of Hi-AP is analyzed 
and the leading behavior, of the resulting error measured on the distribution of 
exemplars, which depends on the dimension and on the size of the subsets, is com- 
puted. Based on these results we enforce the self-similarity of Hi- AP in SectionjUto 
develop a renormalized version of AP (in the statistical physics sense). We finally 
discuss how to fix in a self-consistent way the penalty coefficient present in AP, 
which is conjugate to the number of cluster. 

2 Introduction to belief-propagation 
2.1 Local marginal computation 

The belief propagation algorithm is intended to computing marginals of joint- 
probability measure of the type 

P{^)^l[MXa)l[Hx^), (2.1) 

a i 

where x = (zi, . . . ,xn) is a set of variables, Xa = {xi, i £ a} el subset of variables 
involved in the factor ■00, while the 0i's are single variable factors. The structure 
of the joint measure Pa is conveniently represented by a factor graph [2], i.e. a 
bipartite graph with two set of vertices, !F associated to the factors, and V associated 
to the variables, and a set of edges £ connecting the variables to their factors (see 
Figure l^m Computing the single variables marginals scales in general exponentially 
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with the size of the system, except when the underlying factor graph has a tree Uke 
structure. In that case all the single site marginals may be computed at once, by 
solving the following iterative scheme due to J. Pearl [T]: 

ma—,i{xi) is called the message sent by factor node a to variable node i, while 
rii^aixi) is the message sent by variable node i to a. These quantities would actually 
appear as intermediate computations terms, while deconditioning (|2.ip . On a singly 
connected factor graph, starting from the leaves, two sweeps are sufficient to obtain 
the fixed points messages, and the beliefs (the local marginals) are then obtained 
from these sets of messages using the formulas: 

H^i) = ■^4>i{Xi)Y\_'^a-,i{Xi) 
haiXa) = -^1pa{Xa)Y\_ni^a{Xi) 

On a multiply connected graph, this scheme can be used as an approximate proce- 
dure to compute the marginals, still reliable on sparse factor graph, while avoiding 
the exponential complexity of an exact procedure. Many connections with mean 
field approaches of statistical physics have been recently unravelled, in particular 
the connection with the TAP equations introduced in the context of spin glasses 
|15j . and the Bethe approximation of the free energy which we detail now. 




b 



Fig. 2.1: Example of a factor-graph representing a joint measure of five 
variables (circles) and three factors (squares) 



2.2 AP and SCAP as min-sum algorithms 

The AP algorithm is a message-passing procedure proposed by Frey and Dueck 
[5] that performs a classification by identifying exemplars. It solves the following 
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optimization problem 

c* = argmin(£'[c]) , 

with 

JV JV 

E[c] ^(*' ^■') - E (2-2) 

i—l M— 1 

where c = (ci, . . . , cn) is the mapping between data and exemplars, S{i,Ci) is 
the similarity function between i and its exemplar. For datapoints embedded in 
an Euclidean space, the common choice for S is the negative squared Euclidean 
distance. A free positive parameter is given by 

s —S{i, i), Vi, 

the penalty for being oneself exemplar. x|f ■* [c] is a set of constraints. They read 

if 7^ fi, 3i s.t. Ci = fi, 
otherwise. 

p = is the constraint of the model of Frey-Dueck. Note that this strong constraint 
is well adapted to well-balanced clusters, but probably not to ring-shape ones. 
For this reason Leone et. al. |7j have introduced the smoothing parameter p. 
Introducing the inverse temperature /3, 

P[c]'='iexp(-/3£;[c]) 

represents a probability distribution over clustering assignments c. At finite f3 the 
classification problem reads 

c* = argmax(P[c]) . 

The AP or SCAP equations can be obtained from the standard BP equation [HE] 
as an instance of the Max-Product algorithm. For self-containess, let us reproduce 
the derivation here. The BP algorithm provides an approximate procedure to the 
evaluation of the set of single marginal probabilities {Pi{ci = fJ-)} while the min- 
sum version obtained after taking P —^ oo yields the affinity propagation algorithm 
of Frey and Dueck. The factor-graph involves variable nodes {i,i ^ 1 . . . N} with 
corresponding variable Cj and factor nodes {/i, ^ = 1 . . . N} corresponding to the 
energy terms and to the constraints (see Figure [2?2|) . Let A^^i(ci) the message 
sent by factor ^ to variable i and Bi^^{ci) the message sent by variable i to node 
/X. The belief propagation fixed point equations read: 



A,^,{c, =c) = n B,^^^i^^i)x%c,}, c] (2.3) 



B,^^{c, =c) = n A^->^ic)e^'^''''^ (2.4) 



2.2 AP and SCAP as min-sum algorithms 



5 




Fig. 2.2: Factor graph corresponding to AP. Small squares represents the 
constraints while large ones are associated to pairwise contributions in the 



Once this scheme has converged, the fixed points messages provide a consistency 
relationship between the two sets of beliefs 

1 ^ 

hMcH} - c] = ^l[c]\[B,^,{c,) (2.5) 

^ 4=1 
1 ^ 

6,(c, = c) = - n ^M-ac]e'3^('^^) (2.6) 
The joint probability measure then rewrites 



^'EL *."-'(<=.) 

with Zb the normalization constant associated to this set of beliefs. In (|2.3p we 
observe first that 

A^^,"^ A^^,{c, = u^^l), (2.7) 

is independent of v and secondly that (ci = c) depends only on Bj^f^ [cj = /i) 
and on X^i/^^/j ~ '^)- This means that the scheme can be reduced to the 
propagation of four quantities, by letting 

Af_i^i = A^^i(yCi = //), 

J dcf 1 ^ A^^i 
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which reduce to two types of messages A^^i and Bi—,^. The behef propagation 
equations reduce to 



A, 



B, 



p + (1 - p)B,,^^ + (TV - 1) [p + (1 - p){B^^^ + rij^. B, 
1 



1 + {N -l)[p+{l-p) Ylj^.B,^, 
1 

1 + (AT - 1) Y.u^^. f^eP(s(^..)-s(^,^.)) 



while the approximate single variable belief reads 

1 A 



Pi{Ci = fJ.) = 



Z., A 



^^^^ 



This simplification here is actually the key-point in the effectiveness of AP, because 
this let the complexity of this algorithm, which was potentially iV^ becomes N'^, 
as will be more obvious later. At this point we introduce the log-probability ratios, 



1 



■log 



/3 



B, 



(i °\B,^. 

corresponding respectively to the "availability" and "responsibility" messages of 
Frey-Dueck. At finite /? the equation reads 



+ (1 - e-''«)e^'^''-'' 



-pai- 



-fiq 



with q'= —jj log p. Taking the limit /? — oo at fixed q yields 



if 0, max(-g, min(0, r^^^)) + max(0, rj^^^j] , ^J■7^^, (2. 



Qi^i = minf^g, ^ max(0, r^^i) 

Ti^^ = S'(i,M) - maxfoi,^,, 

After reaching a fixed point, exemplars are obtained according to 
c* = argmax(S'(i,/i) + a^^,) = argmax(ri_+^ + a^^,;) 



(2.9) 

(2.10) 

(2.11) 
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Altogether. I2.8I2.9|2.10I and 12.111 constitute the equations of SCAP which reduce 
to the equations of AP when q tends to —oo. 

3 Hierarchical affinity propagation (Hi-AP) 
3.1 Weighted affinity propagation (WAP) 

Assume that a subset 5 C £ of n points, assumed to be at a small average mutual 
distance e arc aggregated into a single point c e 5. The similarity matrix has to 
be modified as follows 

S{c,i) — >nS{c,i), \fi ^ S 

S(i, c) — > S(i, c), Vi e 5 

S{c,c) — > ^S{i,c), 

and all lines and columns with index i £ '5\{c} are suppressed from the similarity 
matrix. This type of rules should be applied when performing hierarchies. 




Fig. 3.3: 

This redefinition of the self-similarity yields a non-uniform penalty coefficient. In 
the basic update equations (|2.8p . (|2.9p . (|2.10p and (|2.1ip . nothing prevents from 
having different self-similarities because while the key property (|2.7p for deriving 
these equations is not affected by this. When comparing in ()2.2p . the relative contri- 
bution of the similarities between different points on one hand, and the penalties on 
the other hand, we immediately see that s has to scale like the size of the dataset. 
This insures a basic scale invariance of the result, i.e. that the same solution is 
recovered, when the number of points in the dataset is rescalcd by some arbitrary 
factor. Now, if we deal directly with weighted data points in an Euclidean space, 
the preceding considerations suggests that one may start directly from the following 
rescaled cost function: 

1 " 

c— 1 iGc 

Z is a normalization constant 

dcf \ - 

z ^ 2^w,, 

The similarity measure has been specified with help of the Euclidean distance 



d{i,j) = \v,-r,\, V(z,j)e52. 
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3 Hierarchical afFinity propagation (Hi-AP) 



The {wiyWi € 5} is a set of weights attaehed to each datapoint and the self-similarity 
has been rescaled uniformly 

s ^ - ^ s. 

with respect to density of the dataset. V being the volume of the embedding space, 
for later purpose (thermodynamic limit in Section j^. 



3.2 Complexity of Hi-AP 



AP computational complexitjQ is expected to scale like 0{N'^); it involves the 
matrix S of pair distances, with quadratic complexity in the number N of items, 
severely hindering its use on large-scale datasets. 

This AP limitation can be overcome through a Divide-and-Conquer heuristics in- 
spired from jTTj- Dataset £ is randomly split into b data subsets as shown on Fig- 
ure l3.41 AP is launched on every subset and outputs a set of exemplars; the exemplar 
weight is set to the number of initial samples it represents; finally, all weighted ex- 
emplars are gathered and clustered using WAP (the complexity is 0{N^^^) [13]). 
This Divide-and-Conqucr strategy — which could actually be combined with any 
other basic clustering algorithm — can be pursued hierarchically in a self-similar 
way, as a branching process with b representing the branching coefficient of the 
procedure, defining the Hierarchical AP (Hi-AP) algorithm. 

Dataset 



h = 



WAP 



h = l 



h = 2 



WAP 



WAP 



Exemplars 



Fig. 3.4: 



Formally, let us define a tree of clustering operations, where the number h of suc- 
cessive random partitions of the data represents the height of the tree. At each level 
of the hierarchy, the penalty parameter is set such that the expected number of 
exemplars extracted along each clustering step is upper bounded by some constant 
K. 



^ Except if the similarity matrix is sparse, in which case the complexity reduces to 
Nklog{N) with k the average connectivity of the similarity matrix [S]. 
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Proposition 3.1. Letting the branching factor b to 
then the overall complexity C'{h) o/Hl-AP is given by 

C{h) (X kt^n'^ n:$>k. 

Proof. M = N/b^ is the size of each subset to be clustered at level h\ at level 
h—1, each clustering problem thus involves bK = M exemplars with corresponding 
complexity 

M 2 

The total number Ncp of clustering procedures involved is 

1=0 

with overall computational complexity: 

AT 2 _ 1 AT ''+2 

^ ' ^k' (f - 1 ^»-^ ^ 

It is seen that C(0) ^ N^, C(l) cx N^^^,. . . , and C{h) cx for /i > 1 . ■ 



3.3 Information loss of Hi-AP 

Let us examine the price to pay for this complexity reduction. As mentioned 
earlier on, the clustering quality is usually assessed from its distortion, the sum of 
the squared distance between every data item and its exemplar: 

N 
i=l 




Exemplar 



Center of Mass 



Fig. 3.5: 



The information loss incurred by Hi-AP w.r.t. AP is examined in the simple case 
where the data samples follow a centered distribution in JM^ . By construction, AP 
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aims at finding the cluster exemplar rc nearest to the center of mass of the sample 
points noted rem- 

D{c) = |r„„ -rc\^ + Cst 

To assess the information loss incurred by Hi-AP it turns out to be more convenient 
to compare the results in distribution. This can be done by considering e.g. the 
relative entropy, or KuUback Leibler distance, between the distribution Pq of the 
cluster exemplar computed by AP, and the distribution Pc{h) of the cluster exemplar 
computed by Hi-AP with hierarchy-depth h: 

DKL{Pc\\Pcih)) = J Pcik^{r)\og^j^dr (3.2) 



In the simple case where points are sampled along a centered distribution in M'', 
let fc denote the relative position of exemplar rc with respect to the center of mass 

f C = rc - Tern 

The probability distribution of f c conditionally to Vcm is cylindrical; the cylinder 
axis supports the segment (0, rem), where is the origin of the d-dimensional space. 
As a result, the probability distribution of rcm+ fc is the convolution of a spherical 
with a cylindrical distribution. 

Let us introduce some notations. Subscripts sd refers to sample data, ex to the ex- 
emplar, and cm to center of mass. Let x, denote the corresponding square distances 
to the origin, /. the corresponding probability densities and F, their cumulative dis- 
tribution. Assuming 

/■OO 

a = E[xsd] = / xfsd{x)dx, (3.3) 







and 

def 



a - hm (3.4) 

exist and are finite, then the cumulative distribution of Xcm of a sample of size AI 
satisfies 



lim Fcm{ — ) 



^ ^ V2' 2crJ 



M~'oo^^'"'M' r(|) 

by virtue of the central limit theorem. In the meanwhile, x^= |rea; — rcmp has 
a universal extreme value distribution (up to rescaling, see e.g. |18j for general 
methods): 

M^oo^^^AlVd^'' " exp(-aa;5). (3.5) 

To see how the clustering error propagates along with the hierarchical process, one 
proceeds inductively. At hierarchical level h, M samples, spherically distributed 
with variance cr*-''^ are considered; the sample nearest to the center of mass is 
selected as exemplar. Accordingly, at hierarchical level h+1, the next sample data 
is distributed after the convolution of two spherical distributions, the exemplar and 
center of mass distributions at level h. The following scaling recurrence property 
(proof in appendix) holds: 



3.3 Information loss of Hi-AP 
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Proposition 3.2. 



V2' 2cr(''+i> ' 



d=l , 7=1 



exp(— a2''^"^^a;) d = 2,7=1 



d>2,7=^ 



with 



f/i+i) 



/i follows that the distortion loss incurred by Hl-AP does not depend on the hier- 
archy depth h except in dimension d ~ 2. 



Proof. See appendix El 




Fig. 3.6: Radial distribution plot of exemplars obtained by clustering of 
Gaussian distributions of = 10^ samples in H'^ in one single cluster exem- 
plar, with hierarchical level h ranging in 1,2,3,6, for diverse values of d: d = 1 
(upper left), d = 2 (upper right), d = 3 (bottom left) and d = 4 (bottom 
right). Fitting functions are of the form f{x) = Cx'^/^~^ exp(— arr"^/^). 

Figurc lX^ shows the radial distribution of exemplars obtained with different hierarchy- 
depth h and depending on the dimension d of the dataset. The curve for /i = 1 
corresponds to the AP case so the comparison with h > 1 shows that the informa- 
tion loss due to the hierarchical approach is moderate to negligible in dimension 
d ^ 2 provided that the number of samples per cluster at each clustering level 
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is "suffLcicnt" (say, M > 30 for the law of large numbers to hold). In dimension 
d > 2, the distance of the center of mass to the origin is negligible with respect to 
its distance to the nearest exemplar; the distortion behaviour thus is given by the 
WeibuU distribution which is stable by definition (with an increased sensitivity to 
small sample size M as d approaches 2). In dimension d = 1, the distribution is 
dominated by the variance of the center of mass, yielding the gamma law which is 
also stable with respect to the hierarchical procedure. In dimension d ~ 2 however, 
the WeibuU and gamma laws do mix at the same scale; the overall effect is that the 
width of the distribution increases like 2^*, as shown in Fig. 13.61 (top right). 

We can also compute the corrections to this when A/ is finite. We have the following 
property which is valid for (non necessarily spherical) distributions of sample points, 
with a finite variance a. 

Proposition 3.3. For d > 2, at level h, assume that 

= P,fA0)^, (3.6) 
with Q,d = 27r''/^/r(rf/2) the d-dimensional solid angle, and 



are both finite. Defining the shape factor of the distribution by 



,{h} 



dcf o"- 'a'' 



r(i + i) 

the recurrence then reads, 



(3.7) 



.^-^)=.^'^)(l + ^^4^^)+o(M^/-). 



Ml 



(3.8) 



for h > and 



Proof. See appendix IB] Note that the definition qC') is equivalent to the previous 
definition (|3.4p if the distribution of sample points is regular (as well as for the 
variance cr^'*) obviously). ■ 

As a result, for /i = 1 we obtain 
and thereby for h> 1, 
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This means that the mean error within the hierarchical procedure compared to 
the expected error scales like iVra-h. From definition (j3.7|l and (|3.6p . the relation 
between the shape factor u; and the variance a and the density po near the center 
of mass reads, 

dof 1,2/ ^ 2/d ^ lid 

As its name indicates, it depends on the shape of the clusters. It relates the variance 
of the cluster to its density in the vicinity of the center of mass. In what follows it 
will be useful to keep in mind the following particular distributions in dimension d: 

Gaussian distribution 

uniform L2-sphere distribution 

uniform Ll-spherc distribution. 



LO = < 



2r(i + |)' 

d 1 

rf + 2r(i + |)' 

^ [2) 



while by our definition it is equal to unity for the WeibuU distribution 13.51 vielded 
by the clustering procedure. In addition, for a superposition of clusters with iden- 
tical forms and weights, represented by a distribution of the type ()4.ip . by simple 
inspection, the shape factor of the mixture is strictly greater than the single cluster 
shape factor, a soon as two cluster do not coincide exactly. 



4 Renormalized affinity propagation (RAP) 

The self-consistency of the Hi-AP procedure may be exploited to determine the 
underlying number of clusters present in a given data set. We follow here the 
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guideline of the standard renormalization (or deeiniation) procedure which is used 
in statistical physics for analyzing scaling properties. 

basic scaling 



M data subsets 




Fig. 4.8: 

Consider first a dataset, composed of items, occupying a region of total volume 
y. of a rf-diniensional space, distributed according to a superposition of localized 
distributions, 

/w-^i:-r*/o(^), (4.1) 

c— 1 ^ ^ 

where n* is the actual number of clusters, /o is a distribution normalized to one, 
Fc is the center of cluster c and cr* the variance. Assume the data set is partitioned 
into n = xV clusters, x representing the density of these clusters, each cluster 
containing Nc points and occupying an effective volume Vc- The energy p.ip of the 
clustering reads for large N and n << N 

c— 1 i^c 

= a{x) + xs, (4.2) 

where 

^ xV 

c— 1 zGc 

xV 

= Y,^c(Tc, (4.3) 

c=l 

is the distortion function, with Uc = Nc/N is the fraction of data-point and ac the 
corresponding variance of the AP-cluster c: 
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by virtue of the law of large numbers. If Vc represents the effective volume of such 
a cluster, we have 

where ac is a dimensionlcss quantity. For n ^ n* we expect Vc — V/n and (Jc = 
(all AP-cluster have same spherical shape in this limit) so that 

E[c] - x-^/'^a + xs. 
For a given value of the optimal clustering is obtained for 

s«s*. 

self-consistency 

Consider now a one step Hi-AP (sec Figure 14. 8p where the A^-size data set is 
randomly partitioned into M — 1/X subsets of XN points each and where the 
reduced penalty s is fixed to some value such each clustering procedure yields n 
exemplars in average. The resulting set of data points is then of size n/X and the 
question is how to adjust the value s^^^^ for the clustering of this new data set, in 
order to recover the same result as the one which is obtained with a direct procedure 
with penalty s. To answer to this question we make some hypothesis on the data 




Fig. 4.9: 

set. 

• (HI): the distribution of data points in the original set consists in a super- 
position of n* non-overlapping distributions, with common shape factor uj 

• (H2): there exists a value s* of s for which AP yields the correct number of 
clusters C when N tends to infinity. 

• (H3): (t{x) which represents the mean square distance of the sample data 
to their exemplars in the thermodynamic limit, is assumed to be a smooth 
decreasing convex function of the density x = n/V of exemplars (obtained 
by AP) with possibly a cusp aX x = x* (see Figure H.lOp . 
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numbers of clusters number of clusters 



Fig. 4.10: The distortion function for various values of rj for d = 5 and 
n* = 10,30 (left panel). The energy (i.e.the distortion plus the penalties) as 
a function of the number of clusters obtained at each hierarchical level of a 
single Hi-AP procedure for d = 5, n* = W, r] = 0.85, h = 2 and XN = 300 
(right panel). 



The first hypothesis essentially amounts to assume that the clusters are "suffi- 
ciently" separated with respect to their size distribution. This can be characterized 
by the following parameter 

def dmin / . ^ \ 

V = 17^ , (4.4) 

where dmin is the minimal mutual distance among clusters (between centers) and 
Rmax is the maximal value of cluster radius. We expect a good separability prop- 
erty for ?7 > 1. In practice this means the following. When s is slowly increased, 
so that the number of cluster decreases unit by unit, the disappearing of a cluster 
corresponds either to some "true" cluster to be partitioned in on unit less, or either 
to two "true" clusters that are merged into a single cluster. The assumption (H2) 
amounts to say that the cost in distortion caused by merging two different true 
clusters is always greater than the cost corresponding to the fragmentation of one 
of the true clusters. As a result, starting from small values of s, and increasing 
it slowly, one is witnessing in the first part of the process a decrease in the frag- 
mentation of the true clusters until some threshold value s* is reached. At that 
point this de-fragmentation process ends and is replaced by the merging process of 
the true clusters (see Figure l479l) . In the thermodynamic limit (which sustain all 
the present considerations), s* can be viewed as a critical value corresponding to 
a (presumably) second order phase transition, which separates a coalescent phase 
from a fragmentation phase. Note that when the second hypothesis (H2) is satis- 
fied, this implied that (HI) is automatically satisfied by the dataset obtained after 
the first step of Hi-AP performed at s* . Indeed, each partition of the initial set 
yields exactly one exemplar per true cluster in that case, and the rescalcd distri- 
bution of these exemplars w.r.t their "true" center are universal after rescaling. 
The renormalization setting is depicted in Figure 14.81 The dataset is partitioned 
into M = 1/A subsets. We manage that the size \N of each subset remain the 
same at each stage of the procedure. This means that A is set to A^n, if n is the 
expected number of exemplars obtained at the corresponding stage. Under the two 
assumptions (HI) and (H2), the second step of Hi-AP, which corresponds to the 
clustering of An/A data points with a penalty set to Nns{X)/X is expected to have 
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Coalescence ^* Fragmentation 



Fig. 4.11: Sketch of the rescaling property. Comparison of the distortion 
function between two stages of Hi-AP (left panel). Corresponding result in 
terms of the number of clusters as a function of s. 



the following property. 

Proposition 4.1. Let ni [resp. 712/ the number of clusters obtained after the first 
[resp. second] clustering step. If 

5(A) ^ L/!lAy-^ s^ — s, with — « 1, (4.5) 

then either cases occurs, 



s < s* 


then 


"2 


> ?? i 


> n* 


s = s* 


then 


"-2 


= ni 


^ n* 


s > s* 


then 


?^2 


ni 


< n* 



Proof. In the thermodynamic limit the value ni for 71, which minimizes the energy 
is obtained for xi = ni/V as the minimum of 



s + cr'(xi) = 0. 

At the second stage one has to minimize with respect to x, 

EW[c] = ^[^aWix,x,) + xs_, 

where tr'-^'' (x, y) denotes the distortion function of the second clustering stage when 
the first one yields a density y of clusters. This amounts to find x = X2 such that 

We need now to see how, depending on x, 

4^'(?/)'"A-^/V^)(y,x) 
compares with a{y). This is depicted on Figure B. Ill ■ 
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4 Renormalized affinity propagation (RAP) 



Assume first that xi — x* , which obtained if we set s = s* in the first clustering 
stage. This means that each cluster which is obtained at this stage is among the 
exact clusters with a reduced variance, resulting from the extreme value distribution 
properties p.Sp combined with definition p.7p of the shape factor oj: 

4^^ = -i = a,. 4.7 

a; V m / uj 

Note at this point that 

» 1, 



X2/d 

is required to be in the conditions of getting a cluster shaped by the extreme 
value distribution. For y > x*, the new distortion involves only the inner cluster 
distribution of exemplars which is simply rescaled by this (xi/X)'^^'^ factor, so from 
(|4.3p we conclude that 

^i^\y) = ^iy): for y>x*. 

Instead, for y < x*, the new distortion involves the merging of clusters, which inter 
distances, contrary to their inner distances, are not rescaled and are the same as in 
the original data set. This implies that 

—7^{y) < <^'iy), for y < x*. 
ay 

As a result the optimal number of clusters is unchanged, yi = x* . 

For xi < X* , which is obtained when s > s* , the new distribution of data points, 
formed of exemplars, is also governed by the extreme value distribution, and all 
cluster at this level are intrinsically true clusters, with a shape following the WeibuU 
distribution. We arc then necessary at the transition point at this stage: y* — xJl. 
In addition, the cost of merging two clusters, i.e. y slightly below xi, is actually 
greater now after rcscaling. 



dy 



{y)<cr'{y), for y={xi). 



because mutual cluster distances appear comparatively larger. Instead, for y slightly 
above xi, the gain in distortion when y increases is smaller, because it is due to 
the fragmentation of Weibull shaped cluster, as compared to the gain of separating 
clusters in the coalescence phase et former level, 

-^(y) > o-'(y), for y = (xi)+. 

As a result, from the convexity property of cr^^^(?/), we then expect again that the 
solution of (j4.6p remains unchanged yi = xi in the second step with respect to the 
first one. 



^ Fluctuations are neglected in this argument. In practice the exemplars which emerge 
from the coalescence of two clusters might originate from both clusters, when considering 
different subsets, if the number of data is not sufficiently large. 
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Finally, for xi > x* , the new distribution of data points is not shaped by the 
extreme value statistics when the number of fragmented clusters increases, because 
in that case the fragments are distributed in the entire volume of the fragmented 
cluster. In particular, 

^x\Hy) - -^i^iv)^ when xi » x* . 

The rescaling effect vanishes progressively when we get away from the transition 
point, so we conclude that the optimal density of clusters yi is displaced toward 
larger values in this region. 

Wc have tested this renormalizcd procedure, by generating artificial sets of data- 
points in various situations, including different types of cluster shape. Some sample 
plots arc displayed in Figure 14.121 and 14.131 to illustrate the preceding proposi- 
tion 14.11 The self-similar point is clearly identified when plotting the number of 
clusters against the bare penalty, when 77 is not to small. As expected from the 
scaling (|4.5p . the effect is less sensible when the dimension increases, but remains 
perfectly visible and exploitable at least up to d = 30. The absence of information 
loss of the hierarchical procedure can be seen on the mean-error plots, in the region 
of s around the critical value s* . The results are stable, when we take into account 
at the first stage of the hierarchical procedure the influence of the shape of the 
clusters. This is done by fixing the value of the factor form lo to the correct value. 
In that case, at subsequent levels of the hierarchy the default value uj — 1 is the 
correct one to give consistent results. Nevertheless if the factor form is unknown 
and set to false default value, the results are spoiled at subsequent levels, and the 
underlying number of clusters turns out to be more difficult to identify, depending 
on the discrepancy of lo with respect to its default value. We see also that the 
identification of the transition point is still possible when the number of datapoints 
per cluster get smaller (down to 6 in these tests). 

5 Discussion and perspectives 

The present analysis of the scaling properties of AP, within a divide-and-conquer 
setting gives us a simple way to identify a self-similar property of the special point 
s*, for which the exact structure of the clusters is recovered. This property can 
be actually exploited, when the dimension is not too large and when the clusters 
are sufficiently far apart and sufficiently populated. The separability property is 
actually controlled by the parameter 77 introduced in 14.41 For > 1 the underlying 
cluster structure is recovered, and in the vicinity of s* , the absence of information 
loss, deduced from the single cluster analysis is effectively observed. The search of 
the exact number of cluster could then be turned into a simple line-search algorithm 
combined with RAP. This deserves further exploration, in particular from the ap- 
plication point of view, on real data and for clusters of unknown shape. From the 
theoretical viewpoint, this renormalization approach to the self-tuning of algorithm 
parameter could be applied in other context, where self-similarity is a key property 
at large scale. First it is not yet clear how we could adapt RAP to the SCAP con- 
text. The principal component analysis and associated spectral clustering provide 
other examples, where the fixing of the number of selected components is usually 
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not obtained by some self-consistent procedure and where a similar approach to the 
one presently proposed could be used. 
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A Proof of Proposition 2.2 



A Proof of Proposition 2.2 



The influence between the center of mass and extreme value statistics distribution 
corresponds to corrections which vanish when M tends to infinity (see Appendix IbI 
Neglecting these corrections, enables us to use a spherical kernel instead of cylindri- 
cal kernel and to making no distinction between ex and ex, to write the recurrence. 
Between level h and /i + 1, one has: 

/•oo 

f^'/'\-)= K('^'''H-,y)ft''Hy)dy (A.i) 

Jo 

where K{x, y) is the d-dimcnsional radial diffusion kernel, 

j-^, N dcf 1 2^ , , _£+y 

K[x,y) ^ -X i y i I±_i[Vxy)e ^ . 

with Id_^ the modified Bessel function of index d/2 — 1. The selection mechanism 
of the exemplar yields at level h, 



Ft'-'Hx) ^ {F^,^\x)) 
and with a by part integration, (|A.1|1 rewrites as: 
fi'/'\x) = K('^'''\x,0) + 



M 



[Kd iy)> — — {x,y)dy, 



with 



lim M-iA'('''^^)( — ,0) 



d , dx / dx s 



At this point the recursive hierarchical clustering is described as a closed form 
equation. Proposition 13.21 is then based on (|A.2p and on the following scaling 
behaviors. 



lim ^cxp(-aWx^ 

M d 



so that 



lim — ) = lim M^--^ / 



dy 



Basic asymptotic properties of Id/2-1 yield with a proper choice of 7, the non 
degenerate limits of proposition 13.21 In the particular case d — 2, taking 7 = 1, it 
comes: 

/>oo />oo 

1™ ^ir'^(77)== / dy duf^^J{a^^k)Kiu,y) 
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with help of the identity 

1 //3\2i/ ^ 



Again in the particular case c? = 2, by virtue of the exponential law one further has 
(y{h) — l/fjCi)^ finally yielding: 

^{h+D ^ (A 3) 



B Finite size corrections 

We consider a given hierarchical level ft-, r denotes sample points, v^m their cor- 
responding center of mass, and rc the exemplar, which in turn becomes a sample 
point at level /i + 1. We have 

p;-^-^(r)d'^r = P(rced''r) 



(/i+i 

We analyse this equation with the help of a generating function: 

(/).(A) = / d''rp.(i-)e 



A/-1 



where . may be indifferently sc?, c or cm and Ar is the ordinary scalar product be- 
tween two d-dimensional vectors. Let A = |A|, by rotational invariance, p. depends 
only on r and 0. depends solely on A, so we have 

g.(A)^=^'log(0.(A))=log(2^'^/2^°°drr'^-VW(Y)'"'%2-i(Ar)). 

The joint distribution between and Vcm takes the following form 

r 

where by definition Pcm\sd is the conditional density of Vcm to r^rf, with 

gcm\sd{X)^iM-i)g.d{^), (B.i) 

while 

5,„(A) (B.2) 
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B Finite size corrections 



where Qsd is assumed to have a non zero radius Taylor expansion of the form 

„(h) °° „i2n)(n\ 

ri=2 

since by rotational symmetry all odd powers of A vanish and where ti^''^ represents 
the variance at level h of the sample data distribution. In addition the conditional 
probability density of Vsd to Vcm reads 

P sd\cm\^ T ^ cm ) — / \Pcm\sd\\^cm ]\/f\^ — Vsd\cin\^i" I'^cm} 

Pcm ycm ) ^^'^ 

where u = r — Ycm and 6 is the angle between u and Vcm ■ Let 

/(w, Tern) =■ ^(^sd - ^ cm\ > w|rcm)- 

We have 

t^U /'IT 

/(u,r„„) = 1 - ild-i / dccx'^^W d9 sin 6''^^'^ p^^a\cm{x,0, rem)- 
Jo Jo 

with 

n 

'~r(i)' 

the (i-dimensional solid angle. Let 

h{u, rem) log(/(M, rem))- 

We have 

Pl!d^^\^) ==-PidV) / c^'^WPcmlsddrcm " -^|)exp((M- l)h{\r-rcm\,rcm))- 



From the expansion (jB.3|l wc see that corrections in gem and gcm\sd to the Gaussian 
distribution are of order a cm = (yjM as expected from the central limit 

theorem and (Jcm\sd = (-^^ ^ l)cr/M^. Letting y = rem — r we have 

^'i'r^'M -^'i'^'(0)(^)''7 rf^y oxp(-M^(-)(r,y) 



with 



,(A/)^ N def d, M dr^ pid^O 



2(M-- l)cr('^) 



|y + r|2 + (M-l)/i(2/,|y + r|). 



As observed previously P^'^^^^ [r/M^/'^) converges to a WeibuU distribution when M 
goes to infinity, and the corrections to this are obtained with help of the following 
approximation: 
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with 



Ah+l), 



As a result, computing the normalization constant p\,^ (0) and the corresponding 
variance (T^''+^\ yields the following recurrence relations: 

...... ra.H)o<«-(:.£;^-^).„(M-.). 



Letting 



r(i + i) ' 



we get 



Consequently, for ft, = 0, we have 
while for /i > 1 we get 
For h = 1 this reads 
and thereby 



^C^+i) =^W+o(Af2/rf-i), for/i> 1. 
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B Finite size corrections 




Fig. 4.12: Number of cluster obtained in one single run with respect to 
the hierarchical level (left panel). Error distance of the exemplars from the 
true underlying centers obtained with respect to the hierarchical level (right 
panel). In all cases, 10 underlying L2-sphere shaped clusters are present 
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Fig. 4.13: Number of clusters and mean-error as a function of s for Ll- 
sphere (first and second rows) and L2-sphere (third and fourth rows) shaped 
cluster. 




Fig. 4.14: Number of clusters and mean-error as a function of s for L2- 
sphere shaped in d = 3, 5, 10, 20 with n* = 10, 30, 10. 



